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Abstract 

The estimation of the covariance matrix is an initial step in many multi- 
variate statistical methods such as principal components analysis and factor 
analysis, but in many practical applications the dimensionality of the sample 
space is large compared to the number of samples, and the usual maximum 
likelihood estimate is poor. Typically, improvements are obtained by mod- 
elling or regularization. From a practical point of view, these methods are 
often computationally heavy and rely on approximations. As a fast substi- 
tute, we propose an easily calculable maximum a posteriori (MAP) estimator 
based on a new class of prior distributions generalizing the inverse Wishart 
prior, discuss its properties, and demonstrate the estimator on simulated and 
real data. 

Keywords: Covariance estimation, Bayesian method, maximum a 
posteriori, inverse Wishart distribution, Tracy- Widom distribution 



1. Introduction 

The problem of estimating a large covariance matrix with limited amounts 
of data occurs in many different applications of statistics such as image anal- 
ysis, functional data analysis, quantitative finance, analysis of microarray 
data etc. We became interested in this problem through the study of shape 
variations in medical applications, e.g. X-ray images of human vertebra [3]. 
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To study the shape variation in such data, images are annotated by a medi- 
cal expert, and in the case of the vertebra 50 anatomically meaningful points 
were set on each 2 dimensional X-ray image, such that each shape is rep- 
resented by a 100 dimensional vector. For such a high-dimensional space, 
the standard ML covariance matrix estimate requires in the order of 1000 
annotated images to be of reasonable accuracy Unfortunately, this is rarely 
available, since the annotation task is laboursome and medical experts are a 
limiting resource. Therefore, we have been looking into improved estimates 
for small samples of high dimension. 

In this paper we propose a maximum a posteriori (MAP) estimator for 
the unknown covariance matrix based on a new class of prior distributions, 
which we call the power inverse Wishart distributions. We introduce the 
distributions in section [2] and derive the MAP estimator in section [3] We 
compare its properties with those of the usual inverse Wishart MAP estima- 
tor in section [4], derive some asymptotic results in section [5} and demonstrate 
its applicability on simulated (section [6]) as well as on real data (section [7]) . 

2. The Power Inverse Wishart Distribution 

We start by defining a class of distributions on the set of positive defi- 
nite p x p-matrices. This class generalizes the well-known inverse Wishart 
distribution and, as we will argue in the following section, leads to tractable 
MAP estimators of an unknown covariance matrix of a multivariate normal 
distribution. 

Definition 1. The power inverse Wishart distribution with parameters (\I/, m, q), 
where ^ is a positive definite p x p-matrix, m > p, and q G {1,2,...}, is the 
distribution on the set of positive definite p x p-matrices with density given 
by 



W q (B\V,m) = — exp (--tr ( U?- 1 / 2 BVf- 1/2 ^ * 



where c mq is a normalization constant given by 



\qm/2 
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exp ( --tr (B-«) ) \B\- qm/2 - p/2 - 1/2 dB, (2) 
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where the integral is over the set of positive definite p x p-matrices. 
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The distribution is well-defined, when the integral in (|2]) is finite; we 
show this in the following theorem. For q — 1, the power inverse Wishart 
distribution is the well-known inverse Wishart distribution with density 



W-\B\V,m) 




(3) 



where T p Oj ) = Tc p ( p 1 ^ 4 Yi^=i r ( y — J is the multivariate gamma func- 



tion. For p = 1 the power inverse Wishart distribution is the distribution of 



Theorem 1. The function given in Q is a density on the set of positive 
definite p x p-matrices. 

As a preliminary for the proof, recall that any positive definite matrix C 
has a positive definite gth root given by C 1/q = VcA^Vj where V c is a 
orthonormal matrix diagonalizing C, A = V^CVc is the diagonal matrix 
of eigenvalues of C and A. 1 ^ is the diagonal matrix with the gth root of 
the eigenvalues of C in the diagonal (see, e.g. Mardia, Kent, and Bibby j9j 
Appendix A]). 

Proof It follows from Deemer and Olkin jH Theorem 3.7] that 



Thus it is sufficient to show that ([T]) is a density for = I. 

Let C be an inverse Wishart-distributed matrix with parameters / and 
m > p, and consider the density of the distribution of the positive definite 
gth root B = C 1/q of C, 



where J{B q , B) is the Jacobian matrix of the transformation h(B) = B q de- 
fined on the set of symmetric matrices. It follows from Magnus and Neudecker 
p. 438 & Lemma 4.5(vi)] that 



Y~ q where F/# ~ xf m 



W- q (B\I,m) = W- q (^ 1/2 B^ 1/2 \^,m) ■ * 1/2 




p+i 



W-\B q \I,m) ■ \J(B q ,B)\ 



\J(B q ,B)\ = q p \B\ q - l ll 



i<3 
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where X x > A 2 > . . . > X p > are the eigenvalues of .B.The last term may 
be bounded from below as follows: 



^ h_ _ \q-l 1 ~ i^j/^i) 9 _ V"7\ I \ V 



1=0 

i 5-1 / 1 \ 



> r;^ 

max z=0l ... )9 -i ) ^ V ' ; 



maX(=o,..,,-i j j max i=0 ,... i9 -i j j 



Thus 



\J{B q ,B)\ > const - \B 



9 -l+(g-l)(p-l)/2 



Hence the density of B = C 1 ^ bounds 



exp f-^tr |B| 



-qm/2-p/2-l/2 



up to a constant. It follows that ([I]) is integrable, and therefore it specifies a 
density □ 

The next result, which describes the standard (i.e. = I) power inverse 
Wishart distribution, follows directly from Anderson [1, Theorem 13.3.4]: 

Theorem 2. Suppose B is a power inverse Wishart ,m,q) -distributed 
p x p-matrix and let X\ > A 2 > . . . X p > denotes its eigenvalues and V 
the matrix containing its normalised eigenvectors chosen such that the first 
element of each column is non-negative. 

Then (Ai, . . . , X p ) and V are independent, the joint density of the eigen- 
values is 



9{Xu • • • ' Ap) - c m , q r p ( P /2) ' UU ^ +P+1)/2 ' 11 " 



•3)1 

i<j 



and V has the conditional Haar invariant distribution ( cf Anderson m Def- 
inition 13.3.1]). 
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The theorem says that the eigenvectors of a power inverse Wishart distributed 
matrix (including the inverse Wishart distribution) with = I have the 
same distribution as the eigenvectors of a Wishart distributed matrix with 
the same matrix-parameter. Hence the distributions differ in how the eigen- 
values are distributed. 

It follows from Mardia et al. P, Lemma 4.2.1] that the mode of the power 
inverse Wishart distribution is 



To compare the power inverse Wishart distribution to the inverse Wishart 
distribution, we look at the ratio 



Here Ai, . . . , A p denotes the eigenvalues of B. We see that as any A, — > 0, this 
ratio goes to 0. Thus used as a prior for an unknown positive definite matrix, 
the general power inverse Wishart distribution gives smaller credibility to 
small eigenvalues, than does the usual inverse Wishart prior, and this effect 
gets stronger for larger values of q. The behaviour of the ratio as A, — > oo is 
determined by the parameters mi and m q as well as by q: If qm q > mi, then 
the power inverse Wishart will penalise large eigenvalues harder, than the 
inverse Wishart does, whereas it will be more lenient if qm q < m\. If qm q = 
mi, then the ratio will approach a constant as Aj — > oo. Similar comments can 
be made in the case with a general in this case the eigenvalues Ai, . . . , \ p 
denotes the eigenvalues of \I> _1 / 2 _B\I>~ 1 / 2 . Thus ^ is a "scaling parameter" 
and determines the position of the distribution, whereas q determines the 
tail behaviour at the "lower tail" , and the product qm determines the upper 
tail behaviour. 

We illustrate the tail behaviour in figure [T] for p = 1 and in figure [2] for 
p = 2 by plotting the ratios or the level curves of the ratios of the power 
inverse Wishart density to the inverse Wishart density for selected values of 
the parameters. 





W- q {B\I,m q ) 



= const ■ 
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Figure 1: Ratio of the power inverse Wishart density to the inverse Wishart density for 
different values of the parameters. The left hand graph shows ratios for q = 2 and mi = 4, 
the right hand graph shows ratios for q = 4 and mi = 8. In both graphs ratios are given 
for m q = 1, 2 and 4. The ratios have been normalized to take the same value at A = 1. 




Figure 2: Ratio of the 2-power inverse Wishart density with m q = 2, 4 and 8 
to the inverse Wishart density with mi = 8. The ratios arc normalized to take 
the value one at (1,1). The level curves are drawn at 10 c with the value of c 
(-100, -10, -4, -2, -1, -0.5, 0, 0.2, 0.5, 1, 1.5, 2) denoted on the graphs. 



6 



3. Maximum A Posteriori Estimation 



Consider a random sample Xx, ■ ■ ■ , X n e W of n independent and iden- 
tically normally distributed p-dimensional random vectors, where both the 
mean vector /j, and the covariance matrix S are unknown. The covariance 
matrix S is symmetric, and we will assume it to be positive definite. Put 
X = - V™ , Xi and let 

71 £-^l = l L 



1 - 

S = -Y^{X i -X){X i -X) 



n . 

i=i 



denote the empirical covariance matrix. Then the likelihood function is given 
by 



L (fi, Yj\Xi, . . . , X n ) 



ewHEl 1 (X l -^- 1 (X l -^)) 



|£|™/ 2 

= |S|-"/ 2 exp (-|tr (S- 1 ^) 

■ exp (--(x - p) T ?:-\x - » 

Provided that n > p, the likelihood function has a unique maximum at 

fi = X, ± = S. 

If n < p the likelihood is unbounded, and in this case there is no max- 
imum likelihood estimate (MLE). Of course X and S may still be used as 
estimators, but the properties of these estimators are typically poor. In many 
applications it may also be problematic that S is not positive definite. This 
is also the case when using methods such as principal components analysis 
or factor analysis. Even if the intention here is to reduce dimensionality, we 
would generally not want the reduction to be based simply on insufficient 
amounts of data leading to a singular covariance matrix. Moreover, if p is 
much larger than n, then the largest eigenvalue of S may severely overesti- 
mate the largest eigenvalue of S even if n is large (see section [5]). One way 
of mending these problems is to put a prior distribution on the unknown 
parameters and use maximum a posteriori estimators. A standard choice of 
prior for S is the inverse Wishart distribution with parameters m). With 
an improper uniform prior on W for fi this leads to MAP estimators given 
by 

A = X, t= 1 (nS + 9). 

n + m + p + 1 
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Without prior knowledge, a simple choice for the hyperparameter would 
be al for some a. This leads to an estimator of S, which has the same 
eigenvectors as the MLE, but where the eigenvalues have been scaled down 
by n/(n + m + p + 1) and shifted upwards by a/(n + m + p + 1). Thus, 
every eigenvalue of S is regularized in the same way regardless of its size. In 
some applications it may be more reasonable to apply different amounts of 
regularization depending on the size of the eigenvalue. 

Instead of using an inverse Wishart prior for the unknown covariance 
matrix, X, we propose to use a power inverse Wishart distribution as prior. 
Keeping the improper uniform prior for fj,, the resulting posterior is given by 

7T (S, n\X u . . . , X n ) oc L (fi, S|Xi, . . . , X n ) ■ W-«(£|*,m) 
exp (-f tr (IT'S)) • exp (-§ (X - /i) T S" 1 (X - /J,)) 



oc 



1 1/2 



exp ^-itr^*- 1 ^*- 1 / 2 ) ^ 

|S|(n+p+ ? m)/2 



exp 
oc 



-itr f n*~ 1/2 5*- 1/2 • f*-V2 S ^-i/2\ * + (VV2 S ^-i/2^ " 
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(n+p+ 9 m+l)/2 



^f- 1 / 2 5]^f-l/2 

•exp (-^(X - tfV-^X - . 

Maximizing over \i gives us ft = X. In order to maximize over £ we put 
H = X, change parametrization to T = ^^ 1 / 2 S^ ,_1 / 2 j , and take logs 
to obtain 

T^Z(T)=log7r(S,/i|X 1 ,...,X n ) 

= -\ tr („*-V=»5*-V2 . T + T«) + n + P + * m + 1 log |T| 

+ const 

(5) 
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Differentiating wrt. T (see e.g. [8J Chapter 9]) gives us 
dl(T) = -- tr (n^- 1/2 S^- 1/2 dT + qT^dT 



n + p + am + 1 1 . -, .„, , . 

+ — ttzt tr (Y T dT) 

2 |Tf| 1 1 / 

= — tr ( (W~ 1/2 S*- 1/2 + qf 1 - 1 -{n + p + qm + l)T _1 j dTf) , 
which is 0, if 

n^-Va 5*-V2 + qT i-i _( n + p + qm + !) T -i = . (6) 

Differentiating again leads to 

d 2 l(T) = -- tr (dY T ( 9 (g - 1) Y^cHT + (n+p + gm+ l)T~ 2 dT)) , 

so that the Hessian is negative definite. Moreover, by replacing T in ^ by 
tT it is easily shown that for any fixed T the function t — > l(tT) tends to 
minus infinity as t tends to or infinity. Thus we may conclude that Z(T) 
has a unique maximizer, which solves (|6| or equivalently 

n qr-V 2 S&- 1/2 ■ Y + qT q - (n + p + qm + 1)1 = 0. (7) 

By transposing the terms of this equation, we see that any symmetric solu- 
tion, Y, to this equation will commute with \J> -1 / 2 SSI/ -1 / 2 . It follows that 
^-1/25^-1/2 and t 

are diagonalized by the same orthonormal matrix (see 
[TOl lc(iii)]), and consequently the ith eigenvalue A» of Y satisfies 



q\1 + n\i ^-Vzgr^-i/a j . Aj - (ra + p + gm + 1) = 0, (8) 

where Aj(\I>~ 1/2 S\I>~ 1/2 ) denotes the ith eigenvalue of \&- 1 / 2 S*i>~ 1/2 

Theorem 3. If we impose a power inverse Wishart prior distribution for 
S with parameters (m, q) and an improper uniform prior of [i, then the 
maximum a posteriori estimator of S is 

£ = &/*vA~ 1 V T *V 2 , (9) 

where A. is a diagonal matrix with the unique positive solutions to the equa- 
tions ffity in the diagonal, and V is an orthonormal matrix diagonalizing 
^-1/25^-1/2 



9 



Proof The polynomial in (J8|, 

X^qX q + n\i(^- 1/2 S^- 1/2 ) ■ X - (n + p + qm + 1), 

is negative for A = and goes to infinity as A — > oo. Furthermore, it is 
strictly increasing for A > so that ^ has exactly one positive solution. 
Hence A is well-defined. Moreover T = V T A.V clearly solves ([7]). It follows 
that 



The positive solution of (|8j) is easily found numerically; we know that it is 
unique, and by Cauchy's bound [2] it is bounded by 

1 + max (nAi(*~ 1/2 S*" 1/2 ), n + p + qm + /q. 



Hence, we may solve (|8j) by a numerical method such as bisection. In the 
case q = 2, the eigenvalue equations pi) have closed form solutions 



A" 1 = r[Xd^- 1/2 S^ 2 ) 

2{n + p + 2m + l) 



L , 1/9^ i/9xo n + p + 2m + 1 
+ \ XA^- 1/2 S^- 1/2 ) 2 + 8 - — — 



n 2 



It follows that when q = 2, then 



± 



2{n + p + 2m + 



_^5 + ^l/2^^-l/2 5 ^-l/2 



+8 n + P + 2m + 1 /V /2 ^ 

n 2 



which further simplifies to 



t = - U - -(s+(s 2 + 8a 2 n+P + * m+1 l) 1/2 ) (10) 

2(n + p + 2m + l)y \ n 2 ) 1 

when ^ = al. 
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4. Regularization: Floor and shrinkage 



In the previous section we derived the power inverse Wishart MAP, which 
includes the usual inverse Wishart MAP as a special case. In this section we 
will discuss and compare how the MAP estimators regularize the MLE. We 
will focus mainly on the case, where ^ is a diagonal matrix, as this allows 
us to give some concrete expressions, but we will also comment on results for 
more general choices of ^. 

When ^ = al we may write 

± = VAV T , (11) 

where the orthonormal matrix V diagonalizes S, and A is the diagonal 
matrix with diagonal elements given by the positive solutions to the equations 

(n+p + qm + 1)A| - nX { (S) Xj' 1 - qa q = 0. (12) 

In this case, the MLE and the various MAP estimators all share the same 
eigenspaces, i.e. they are diagonalised by the same orthonormal matrix V. 
The eigenvalues of the MAP estimators are the diagonal elements of A from 



(11), i.e. the solutions to the equations (12). Thus, the MAP estimators 
regularizes the eigenvalues of S, but leave the eigenvectors unchanged. Hence 
their difference is, how the eigenvalues are regularized. 

If A is an eigenvalue of S, then the corresponding eigenvalue for the inverse 
Wishart MAP estimator ^ is 

(nX + a), (13) 



n + m + p + 1 



and for the 2-power inverse Wishart MAP (10) we get 



( X + Jx* + 8 n + P + * m + 1 a2 I. (14) 



2(n + p + 2m + 1) IV n 

Hence, both MAP estimators regularize the MLE by imposing a lower limit 
for the eigenvalues, which we denote the floor, and shrinking large eigenval- 
ues by multiplying with a factor smaller than 1. In other words, both MAP 
estimators increase small eigenvalues and decrease large eigenvalues as com- 
pared to the MLE. We define the shrinkage as the limit of the regularized 
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eigenvalue divided by the corresponding unregularized eigenvalue as the lat- 
ter tends to infinity. Thus, the shrinkage is the (asymptotic) scaling of large 
eigenvalues performed by the MAP estimator, whereas the floor is the lower 
limit for small eigenvalues imposed by the MAP estimator. The floor and the 
shrinkage factor both improve the estimation: The floor serves to make the 
estimator positive definite, whereas shrinking is beneficial for the estimation 
of the largest eigenvalues, as these tend to be overestimated, when p is not 
negligible compared to n (see also the following section). 

For the inverse Wishart MAP, the floor and the shrinkage are 

a n 
and 



n+m+p+1 n+m+p+1 

respectively, whereas for the 2-power inverse Wishart the floor and shrinkage 
are 

2 a n 
and 



y/n + p + 2m + 1 ' n + p + 2m + l 

respectively. For general q the floor and shrinkage are 

a [ ) and (15) 

V n + p + qm + 1 J n + p + qm + 1 



respectively. The floor follows directly from (12), which also shows that 



Tt 

\ > — — —HS). (16) 

n + p + qm + 1 

Combining this with Cauchy's bound [2] 

Xi < 1 + max (a q q, n\i(S)) /(n + p + qm + 1), (17) 



we obtain the shrinkage given in (15) above. 

The inverse Wishart MAP regularizes the eigenvalues by applying a linear 
function to the eigenvalues of S; the power inverse Wishart MAP returns a 
strictly increasing and strictly convex function of the eigenvalues of S. For 



q = 2 this follows directly from the expression (14). For general q, the 
Implicit Function Theorem gives us 

d\ t n 



d\i(S) q(n +p + qm + 1) - n(q - l)Aj(5)/A, ; 
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which is positive by (16), so that the function is increasing. Differentiating 
again we obtain 

d 2 \ n 2 (q-l) 1 / Xi(S) d\i 



d\i(S) 2 (q(n + p + qm + l) - n(q - l)Xi(S)/Xi) 2 Xi \ \ d\(S 

which is positive, proving convexity. The convex regularization imposed by 
the power inverse Wishart prior has the effect that the difference between 
small eigenvalues after regularization is smaller than those between large 
eigenvalues. Thus the power inverse Wishart MAP regularizes eigenvalues 
differently depending on their sizes. 

We also note that with the same floor and shrinkage, the eigenvalues of 
a power inverse Wishart MAP will always be smaller than the eigenvalues of 



the inverse Wishart MAP. Moreover, as the value of the derivative (18) at 
zero is a decreasing function of q, the eigenvalue of a power inverse Wishart 
MAP corresponding to any specific eigenvalue of S is decreasing as a func- 
tion of the power q, when the floor and shrinkage are unchanged. 

It is difficult to extend these results to the general case, where \P is not 
of the form al, in a useful way. Clearly the results may be extended to 
results concerning the MAP estimator of \I>~ 1 / 2 E\t' -1 / 2 by replacing S with 
*~ 1/2 S*" 1/2 , S with ijf-y^ij/- 1 / 2 and putting a = 1. From this we see 
that the zth diagonal element of A is larger than (3 and smaller than 
/3+ 7 A i (*- 1/2 S'*- 1/2 ), where (3 and 7 are the floor and shrinkage respectively 



from (15) with a — 1. Thus in the usual ordering of positive semi-definite 



matrices we have 



(31 < A 1 < (31 + j A, 



where A is the diagonal matrix with the eigenvalues of ^ l ^ 2 S^ 1//2 in the 
diagonal. From this we obtain 

/?* < S < /3* + 75. (19) 

Thus in the general case, we may talk of a "matrix floor", (3*&, and also here 
there is a shrinkage effect, but the actual shrinkage may be smaller than the 
factor 7. 



The inequalities in (19) has two trivial consequences that may be worth 
pointing out. The first is that similar inequalities hold for the diagonal 
elements of the matrices, i.e. for the estimated variances. The second conse- 
quence is that the MAP estimator has moments of all orders. 
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5. Asymptotic results 

In a standard asymptotic set-up with m, and p fixed as n increases, 
the asymptotic behaviour of a power inverse Wishart MAP is the same as 
the asymptotic behaviour of the MLE. 

Theorem 4. Suppose that ^ , m, and p are fixed as n increases. Then the 
power inverse Wishart MAP S and its eigenvalues have the same asymptotic 
distributions as the MLE S. 

Proof First consider the case where \I/ = al. As the eigenvalues of S are 



bounded in probability by (17), it follows that S is bounded in probability. 
Hence, re- writing ([7]) as 



n + p + qm + £</-i _ a q q j. 
n I n 



it follows that S = S + op{\/ y/n) and the result follows. 

With a general fixed it follows that ^- l l 2 t^>- 1/2 = ^- l/2 S<b- 1/2 
+op(l/ y/ri), implying that also in this case the MAP estimator and the 
MLE have the same asymptotic distribution. 

The results concerning the eigenvalues follow by continuous mapping. □ 

The densities of the limiting distributions in the case where S = I are 
given in Anderson [TJ Theorem 13.3.5]. 

As indicated in the introduction, our main interest is in estimating the co- 
variance matrix in situations, where p is large compared to n. Assuming 
that the components of X,i are iid standard normal, and that both n and p 
increase such that n/p — > 7 G [0; 00], it is known that 

^max(^) f-^n,p 

where A max (>S') denotes the largest eigenvalue of S, and p niP and a nyP are 
given by 



- / , , x V3 ( 20 ) 

n + Jp (I 1 x 1 



n \ y/n up 
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converges in distribution to a Tracy- Widom distribution [HI E] . For the MAP 
estimators we show the following result: 

Theorem 5. Suppose that Xi, . . . ,X n are independent, standard normally 
distributed random variables. Let AmL denote the largest eigenvalue of the 
MAP estimator of S based on an power inverse Wishart prior with parame- 



ters (al,m,q). Then with fi n>p and a n>p as in (20), 



\(<J) n 

Amax n+p+qm+1 ^ n 'P 
n 

n+p a n,p 

converges in distribution to a Tracy- Widom distribution as n,p — >• oo such 
that n/p — > 7 G [0; oo], and m/p — > k G [1; oof. 

Proof The largest eigenvalue of the inverse Wishart MAP estimator is given 
by 



^mL = — — 1 ——r(n\ max (S) + a) 
n + m + p + 1 



Consequently, 



max n+m+p+1 ^ n 'P _ n+rn+p+1 ^mml" ) ~ ^n,p , n+m+p+1 



_ r + ., 

n+p+m ® n -,V n _|_p_|_ m C"n,p ra _|_p_|_ m &n,p 

converges to a Tracy- Widom distribution, as ncr n>p — > oo. 

A more indirect argument is needed for the general case. Recall that the 



eigenvalues solves (|8j), and that the derivative (18) is positive. This implies 
that AmL solves (pi) for Xi(S) = A max (<S). Hence, 



in 



+ P + qm + l) (\± - - W A max (S)) (ASL) 9 1 = a\ (21) 

\ n + p + qm + 1 J 



Write 



n . . n A max (5) - // n . 



n + p + qm + 1 n + p + qro + 1 

n 

n + p + gm + 1 



15 



and observe that the first term is op(l) whereas the second term converges 
to a positive constant. Thus by the lower bound (16) it follows that AmL is 



bounded away from in probability. Consequently, we obtain 

AmL - n+p+ ™ m+1 Amax('S') / a n , p 

~ u p — ; ; r~r = °p\*-> 



<?n, P \n+p + qm + l 



from (21), and hence 



max n+p+qm+l " n 'P _ n+p+<jm+l ^max^J — ^n,p 

. ^ r7. .r) „ i „ i L/ r, 



+ o P (l) 



n+p+qm n >P n+p+qm n >P 

converges to a Tracy- Widom distribution. □ 

Remark. Recall that m > p so that m must increase at least as fast as 
p. Hence in theorem [5j k cannot be smaller than 1. A finite value of k means 
that q increases at the same rate as p whereas k = oo would mean that q 
increases at a faster rate. Note that our result does not include this scenario. 

It follows from the proof of theorem [5] that 

A lim - u =i + 2 ^- qK 

^max T 11111 , , , -I H"n,p 1 ^ -, . . j 

n,p-s>oo n + p + qm + 1 1 + 7 + qn 

where the last term is interpreted as 0, if 7 equals 00. Thus, the maximal 
asymptotic bias is smaller than 1. We note that the asymptotic bias of the 
largest eigenvalue of the power inverse Wishart MAP is bounded, whereas 
the asymptotic bias of the largest eigenvalue of the MLE is unbounded. In 
cases where p < n (so that 7 > 1) we may actually choose q and m such that 
the asymptotic bias is 0. Furthermore, the rate of convergence of the largest 
eigenvalue of the power inverse Wishart MAP is never slower than the rate 
of convergence of the largest eigenvalue of the MLE. 

Remark. It is not obvious how to extend this result to the case, when ^ 
is not of the form al, since in this case the largest eigenvalue of the MAP 
estimator is not a simple function of the largest eigenvalue of the MLE. A 
related question is, what happens to the asymptotic results, when the co- 
variance matrix of the underlying normally distributed data is S instead of 
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I. In this case the largest eigenvalue of S l ^ 2 SH has an asymptotic 
Tracy- Widom distribution. As 

A m ax(S 1 2 SX 1 ^ 2 )A m i n (S) < A max (>S') < A max (S 1 ^ 2 S'S 1 ^ 2 )A max (S), 

the asymptotic distribution of A max (>S') depends on how the eigenvalues of £ 
depends on p. 

6. Simulations 

To investigate the finite sample behaviour of our estimators we report 
on a small simulation study We only consider the MLE, the usual inverse 
Wishart MAP and a power inverse Wishart MAP with q = 2. Both MAPs 
are based on priors with \I/ = al. 

We consider two types of covariance matrices: The first is S = I, the 
second is a diagonal matrix with diagonal elements equal to 



a 2 r - 7 forl<z<p/10, 
V(jy-oV .i forp/10<z<p, 



which is illustrated in figure [3j Here there are a few large eigenvalues, but 
after a steep decrease the remaining eigenvalues are small and only decrease 
slowly. This covariance matrix is chosen to loosely mimic the behaviour of 
the eigenvalues in the real data example in the following section. We consider 
the behaviour of the MAP estimators under the quadratic loss function 

L 2 (S,S) = tr((S-S)(S-S) T ). 

The risk of the MLE and the inverse Wishart MAP can be calculated ex- 



plicitly (see Appendix A), but the risk of the power inverse Wishart MAP 
cannot, so we rely on simulations. We will give results for three choices of 
p, namely 10, 50 and 100. For each value of p, we will use n = p/2,p, 2p to 
investigate the behaviour in three different "asymptotic scenarios" . 

We first note that it is sufficient to consider diagonal matrices for S: For 
any orthonormal matrix V we have 
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10 20 30 40 50 



Figure 3: Eigenvalues of £ for p — 50 used in the simulation. 



Risk 


p = 10 p = 50 p = 100 


n = p/2 
n = p 
n = 2p 


18 98 198 
10 50 100 
5.25 25.25 50.25 



Table 1: Quadratic risk of the MLE. 



and since 



V T SV = 1 V ' sv 



2(n + p + 2m + 1) 



+ ((V^SVr + 8 n + P+ ^ m + 1 al) 1/2 ) 

n 2 J 



for the 2-power inverse Wishart MAP (and with a similar result for the 
inverse Wishart MAP), the risks are left unchanged by rotations. 



In our simulations, we choose a 2 in (22), such that the risks of the MLE 
for given values of p and n are the same in the two examples; see table [T] 
for the values of these risks. We do not vary the variance parameter a 2 in 
the simulations, because increasing a 2 will give the same results as keeping 
it fixed while lowering the floor and scaling the resulting risks. Thus it is 
sufficient to vary the floor. 
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When comparing the two MAPs, the choice of hyperparameters is cru- 
cial: By choosing suitably different hyperparameters we can easily make one 
MAP looks superior to the other. To avoid this we try to choose the hyperpa- 
rameters of the inverse Wishart prior so that the two MAP estimators have 
the same floor and the same overall amount of shrinkage. We believe that 
a reasonable comparison should use the same floor. However, if we use the 
same floor and the same shrinkage factor, then the regularization curve for 
the power inverse Wishart MAP (14) will be below the regularization curve 
for the inverse Wishart MAP (13) and our simulation results would be more 
a consequence of different amounts of shrinking rather than of the difference 
between the estimators. In order to circumvent this effect, we write (13) and 
(14) as a' A + b' and aX + ay/X 2 + b respectively. Here a and b are functions 
of the the chosen floor and shrinkage of the 2-power inverse Wishart MAP, 
and b' is just the common value of the floor. For chosen values of floor and 
shrinkage for the power inverse Wishart MAP, we choose a', such that 

0=^ (a\ + aVX 2 + b'-a'X- &') dX 

for a suitable value of L. Using L = oo leads to a' = 2a, i.e. the same 
shrinkage factor for the MAPs, so we need to choose a finite value of L. 
We choose L equal to the 99%-quantile in the distribution of the largest 
eigenvalue of the MLE. In this way the two MAPs has the same "average 
regularization" over the plausible range of observed eigenvalues. 

The shrinkage factors of both MAPs are bounded by the fact that m > p. 
We use the maximal shrinkage factor for the power inverse Wishart MAP as 
well as factors 10% and 20% smaller. We also use three different values for the 
floor -0.8, 1, and 1.2- corresponding to the average value of the eigenvalues 
of S (to two decimal places for the matrix given by (22)) and values 20% 
smaller and larger. 

The results based on 5,000 simulations are given in table [2] and |3j The 
differences between the two MAPs are small compared to the improvement 
over the MLE (see table [I]). This is not unexpected. We have chosen the 
hyperparameters of the priors in order to make the MAP estimators as similar 
as possible, and all our simulations are in situations, where the MLE is not 
expected to work well. We see that choosing the floor equal to 1 typically 
leads to smaller risks. This is not surprising for the E = J case, where all 
eigenvalues are equal to 1. Indeed, in this case it is optimal to use a floor 
equal to 1 (a = n + m + p + 1) and shrink as much as possible (m — > oo). 
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floor 




0.8 






1 






1.2 




(p,n) 


shrink 


1 


0.9 


0.8 


1 


0.9 


0.8 


1 


0.9 


0.8 


(10,5) 


q=l 


0.34 


0.31 


0.30 


0.26 


0.20 


0.15 


0.97 


0.87 


0.79 




q=2 


0.62 


0.60 


0.57 


0.10 


0.08 


0.07 


0.21 


0.22 


0.23 


(10,10) 


q=l 


0.40 


0.33 


0.27 


0.67 


0.52 


0.39 


1.70 


1.48 


1.29 




q=2 


0.44 


0.43 


0.41 


0.13 


0.11 


0.09 


0.47 


0.45 


0.43 


(10,20) 


q=l 


0.72 


0.53 


0.37 


1.51 


1.17 


0.90 


3.01 


2.56 


2.16 




q=2 


0.30 


0.27 


0.25 




0.2£ 


0.20 


1.00 


0.91 


0.84 


(50,25) 


q=l 


1.40 


1.37 


1.36 


0.95 


0.75 


0.58 


4.44 


4.09 


3.76 




q=2 


3.12 


2.99 


2.86 


0.55 


0.46 


0.37 


1.07 


1.09 


1.13 


(50,50) 


q=l 


1.27 


1.13 


1.05 


2.08 


1.64 


1.26 


6.77 


6.05 


5.39 




q=2 


2.19 


2.11 


2.05 


0.5S 


0.5£ 


0.45 


2.32 


2.23 


2.15 


(50,100) 


q=l 


1.63 


1.22 


0.92 


4.22 


3.31 


2.54 


10.60 


9.24 


8.00 




q=2 


1.42 


1.31 


1.23 


1.56 


1.26 


1.00 


4.93 


4.53 


^ 


(100,50) 


q=l 


2.78 


2.71 




1.95 


1.54 


1.19 


9.00 


8.27 


7.60 




q=2 


6.23 


5.97 


5.72 


1.10 


0.P2 


0.75 


2.15 


2.19 


2.25 


(100,100) 


q=l 


2.52 


2.24 


2.07 


4.20 


3.31 


2.54 


13.65 


12.18 


10.85 




q=2 


4.38 


4.22 


4.09 


1.35 


1.12 


0.89 


4.63 


445 


4.29 


(100,200) 


q=l 


3.26 


2.42 


1.83 


8.45 


6.64 


5.09 


21.26 


18.54 


16.06 




q=2 


2.83 


2.60 


2.45 


3.11 


2.53 


2.00 


9.85 


5.05 


S.2S 



Tabic 2: Quadratic risk, S = I. Lines with g = 1 are for an inverse Wishart MAP, 
q = 2 for the power inverse Wishart MAP. The "shrink" is the factor multiplied onto the 
maximally possible shrinkage factor for the power inverse Wishart MAP. The smallest risk 
for each combination of (p, n) is given in bold; the smallest risk for each combination of 
(p, n) and floor and shrinkage is given in italics. 

But it is also the case for the more realistic example, where most of the 
true eigenvalues are smaller than 1. Thus, it seems overall beneficial to 
overestimate small eigenvalues to some extent. On the other hand, as one 
would expect, it is also clear in our simulations that a floor that is "too 
small" is preferably to one that is "too large" . 

For the values used here, more shrinkage (smaller values of the shrinkage 
factor) generally leads to smaller risk, regardless of the floor for the values 
used here. Obviously, there will be a limit to this effect: If the floor is too 
low or too high, too much shrinking will lead to higher risks due to estimates 
that are too small or too large. 

Overall the power inverse Wishart MAP performs better than the usual 
inverse Wishart MAP, when the floor is not too low. It should also be clear 
that we cannot conclude that the power inverse Wishart MAP is always 
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floor 




0.8 






1 






1.2 






shrink 


1 


0.9 


0.8 


1 


0.9 


0.8 


1 


0.9 


0.8 


(10,5) 


q=i 


0.38 


0.36 


0.34 


0.31 


0.25 


0.20 


1.02 


0.92 


0.84 




q=2 


0.67 


0.64 


0.62 


0.15 


0.13 


0.12 


0.26 


0.27 


0.27 


(10,10) 


q=l 


0.44 


0.36 


0.31 


0.70 


0.56 


0.44 


1.73 


1.51 


1.33 




q=2 


0.48 


0.47 


0.46 


0.18 


0.15 


0.13 


0.51 


0.49 


0.48 


(10,20) 


q=l 


0.74 


0.55 


0.41 


1.52 


1.20 


0.93 


3.04 


2.59 


2.20 




q=2 


0.33 


0.30 


0.29 


0.35 


0.29 


0.24 


1.04 


0.95 


0.88 


(50,25) 


q=l 


7.33 


7.42 


7.53 


6.95 


6.86 


6.79 


10.49 


10.24 


10.02 




q=2 


9.24 


9.19 


9.15 


6.73 


6.72 


6.70 


7.31 


7.40 


7.51 


(50,50) 


q=i 


6.38 


6.44 


6.56 


7.28 


7.02 


6.84 


12.02 


11.50 


11.02 




q=2 


7.53 


7.60 


7.69 


6.12 


6.14 


6.18 


7.83 


7.89 


7.95 


(50,100) 


q=l 


5.68 


5.55 


5.56 


8.36 


7.74 


7.27 


14.83 


13.74 


12.80 




q=2 


5.59 


5.72 


5.91 


5.88 


5.82 


5.82 


9.38 


9.20 


9.06 


(100,50) 


q=l 


25.11 


26.11 


26.55 


25.06 


25.09 


25.18 


32.24 


31.94 


31.71 




q=2 


29.81 


29.89 


29.98 


24.89 


25.01 


25.16 


26.08 


26.42 


26.81 


(100,100) 


q=i 


22.25 


22.76 


23.39 


24.22 


24.08 


24.04 


33.86 


33.12 


32.51 




q=2 


24.72 


25.19 


25.72 


22.09 


22.44 


22.80 


25.64 


26.01 


26.43 


(100,200) 


q=i 


18.81 


19.15 


19.73 


24.39 


23.71 


23.31 


37.48 


35.84 


34.49 




q=2 


18.32 


19.14 


20.04 


19.21 


19.61 


20.14 


26.43 


26.57 


26.82 



Table 3: Quadratic risk, X given by (22 1. Lines with q = 1 are for an inverse Wishart 
MAP, q = 2 for the power inverse Wishart MAP. The "shrink" is the factor multiplied onto 
the maximally possible shrinkage factor for the power inverse Wishart MAP. The smallest 
risk for each combination of (p, n) is given in bold; the smallest risk for each combination 
of (p, n) and floor and shrinkage is given in italics. 
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better than the usual inverse Wishart MAP. Along with the other hyperpa- 
rameters, the power q must be chosen by the data analyst. 



7. Application to real data 

We consider the data set analysed by Shepstone, Rogers, Kirwan, and 
Silverman [11] , who studied the intercondylar notch in human osteoarthritic 
and non-osteoarthritic femora. The authors considered 96 human femora 
from a large skeletal population. The femora were annotated by sex as well 
as distal eburnation. The available data is a sampling of a 2-dimensional 
spline curve approximation of the silhouette of the condyle in 50 arch length 
equidistant points normalised to start in (0,0) and end in (1,0). 

We only consider a part of the data set, namely the 21 condyles with 
signs of distal eburnation. One of these (marked " 2283R" in the data) differs 
markedly from the rest of the condyles (see figure [8]), and we omit it from 
the estimation procedure. Later we will use the estimated covariance matrix 
to find a prediction of this condyle treating the middle part as missing. In 
this application, n = 20 whereas p = 96 (two times 50 minus the two end 
points, which are fixed). 

In data like these, it would be natural to expect adjacent x (or y) co- 
ordinates to be highly correlated and distant x (y) coordinates to be less 
correlated, so we will let our choice of ^ reflect this. The x and y coordi- 
nates may also be correlated, but we expect this correlation to be smaller, 
and we are not sure of its sign and put this part of the hyperparameter \I/ 
equal to 0. Also for simplicity, we assume variance homogeneity in our prior 
even though it is clear from the fact that the outlines of the notches have 
been "tied down" at the ends, that there will be less variation near the ends 
than in the middle. These considerations lead to \I/ = a^o with 



AR(1) P 
ARff 



where AR(l) p is a correlation matrix for an AR(l)-process with parameter 
p, i.e. a matrix with (z, j)th element equal to p' 4 ~ J ', and is a matrix of 0s. 
Thus, we use the same correlation parameter for both x and y coordinates 
as well as assume variance homogeneity. This may be too simplistic, but 
without strong prior beliefs we prefer to keep ^ simple. We use a prior with 
q = 2 and m = p; larger values of q and m leads to smaller shrinkage factors, 
and with p considerably larger than n we expect that this will give a sufficient 
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MAP 



MLE 



Figure 4: Estimated variances, the MLE on the right, the MAP on the left. The grey lines 



in the plot on the left are the bounds from ( 19 ) 



amount of shrinkage. The values of p (= 0.94) and a (= 0.02535) are chosen 
by predictive cross validation [5J using importance sampling. 

Figure [4] shows how the estimated variances are lifted (by the floor) and 
shrunken, but also that the relative relationship between the variances are 
more or less unchanged. The MAP estimators of the large variances are much 
smaller than the MLEs, which of course is an effect of a and the shrinkage 
factor being fairly small; by Q the prior mode is located at 0.0021\t , o- The 
smaller variances are lifted, and the averages of the estimated variances (the 
traces of the estimators) are not markedly different (0.0021 for the MAP and 
0.0029 for the MLE). 

Turning next to the estimated correlation matrix (figure [5]), we see how 
the prior independence of x and y coordinates removes most of the correlation 
between x and y coordinates. The prior's AR(l)-structure is also evident in 
the correlations between x coordinates and between the y coordinates. 

The eigenvalues and the first four eigenvectors of the MAP and the MLE 
are shown in figures [6] and [7j We see that the prior lifts the eigenvalues; only 
the largest eigenvalue is smaller when estimated by the MAP, than when it 
is estimated by the MLE. Note that the y-axis in figure [6] is logarithmic, so 
that the difference between the largest eigenvalues of the two estimators is 
rather big. The eigenvalues of the MAP estimator are pairwise similar. This 
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MAP s MLE 




Figure 5: Estimated correlation matrices; MLE on the right, MAP on the left. The grey 
tone-bands are based on the 5%-, 10%-, . . . , 95%-quantiles of the elements of the two 
estimators. 

is probably an effect of the block-diagonal it tends to split the variation 
into a part mostly related to the x-coordinates and a part mostly related 
to the ^/-coordinates. This is also what we see from figure [7) Indeed it 
seems that the sinusoidal-looking eigenvectors of AR(l)-correlation matrices 
and the block diagonal form have a dominant effect on the resulting MAP 
estimator. 

Any application of MAP estimation is a compromise between the data and 
the prior: We wish to balance the information provided by the data with the 
stability introduced by the prior. It is not surprising that the prior has a large 
effect in this example. Even if we suspect that the true covariance matrix 
is more complicated, there hardly is any information in the data to help us 
discover it. The size of dataset is very small compared to the dimension of the 
unknown covariance matrix, so the shrinkage factor is quite small, and has 
a lot of weight in the resulting estimator. Though this is the intended effect 
of MAP estimation, it also means that the prior should be chosen carefully. 
In this example we have used a very simple choice of More complicated 
choices may be considered: Different variances for x- and ^/-coordinates, as 
well as correlation between x and y-coordinates are easily implemented in 
the estimator. However, choosing the values of the hyperparameters is more 
complicated. Our solution to this problem is basically a grid search, and 
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Figure 6: Eigenvalues of the MLE ('+') and the MAP (V); note that the y-axis is loga- 
rithmic. 





Figure 7: First 4 eigenfunctions of the estimators. The top row is the MAP, the bottom 
row is the MLE. 
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Figure 8: Prediction of the middle part of notch "2283R". The full line is the mean 
shape of the notches, the grey lines the observed notches. The circles and bullets outline 
notch "2283R" as it is in the data. We interpret the bullets as observed and the circles as 
missing. The prediction of the missing part is given by the pluses. 

the more parameters that need to be chosen, the longer the computation 
time. For this reason, we will not attempt a more complicated prior for this 
example. 

As mentioned at the beginning of this section the condyle "2283R" differs 
radically from the rest. As seen in figure [8j where it is represented by circles 
and bullets, it seems to have had its middle part "cut off", when compared 
to the other condyles in the dataset (grey curves in the figure). As an illus- 
tration we pretend that the middle part (the part of the condyle represented 
by the circles) are missing data and try to predict it. The usual EBLUP 
formula (see e.g. Anderson [TJ p. 37]) based on the MLE breaks down; there 
are 30 observed points (the bullets in the figure), so with only 20 fully ob- 
served condyles the covariance matrix corresponding to the observed part of 
" 2283R" is singular and cannot be inverted. The MAP, on the other hand, is 
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regular, and when using this in the formula, we obtain the prediction given 
in figure [8] by the pluses. 

8. Conclusion 

In this paper we have introduced a new class of distributions -the power 
inverse Wishart distributions- on the set of positive definite matrices. Used 
as priors for unknown covariance matrices of multivariate Gaussian data, they 
lead to easily calculable maximum a posteriori estimators. Our simulations 
suggest that the MAP estimators perform better than the MLE in terms of 
overall quadratic risk. We have derived some asymptotic properties of these 
estimators and have seen that these are as good as or in some situations even 
better than those of the MLE. 

As we have seen in sections [6] and [7] the choice of prior influences the MAP 
estimator. Obviously, if this was not the case, there would be little reason 
for using the MAP estimator. On the other hand, it also means that the 
prior should be chosen carefully. In section [7] we chose the form of the prior 
mode based on prior beliefs but the values of it was determined by cross val- 
idation. Our implementation of this cross validation is too computationally 
demanding to allow a further investigation of its properties, so it is difficult 
to know if this is in any sense optimal. Clearly, this is an area that requires 
additional work. 

It is quite easy to extend our results (except theorem [T]) to improper priors 
with m < p; in theorem [5] this would allow k to be any non-negative real. By 
allowing improper priors, we could obtain a MAP estimator in our example 
with less shrinkage than the one we have used. It is less obvious whether our 
results can be extended to values of q that are not positive integers, as many 
of our arguments rely on q being a positive integer. 

We hope that the additional flexibility provided by the power inverse 
Wishart MAP will prove to be useful when estimating large covariance ma- 
trices based on limited amounts of data. 
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Appendix A. Quadratic risk 

For estimators of the form 

t = aS + bl 

such as the MLE and the inverse Wishart MAP estimator, the expected 
quadratic risk is 

£[L 2 (E, £)] = tr (E [(aS + bl - £) 2 ]) 




+ 2a n -± (6tr(E-tr(S 2 ))) 
+ 6 2 p-26tr(S) +tr (S 2 ) 

as S is Wishart distributed with parameters (n — 1, S/n). 

The expression for _E[L 2 (S, £)] is a convex polynomial of (a, b) of degree 
two and thus has a minimal value. Thus, there are unique optimal values 
of the floor and the shrinkage for the inverse Wishart MAP (for a given S), 
but there are also choices that will lead to inverse Wishart MAPs with larger 
risks than the MLE. 
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